full model
Induced Model Matching: Restricted Models Help Train Full-Featured Models
We consider scenarios where a very accurate (often small) predictive model using restricted features is available when training a full-featured (often larger) model. This restricted model may be thought of as ``side-information'', and can come either from an auxiliary dataset or from the same dataset by forcing the restriction. How can the restricted model be useful to the full model? To answer this, we introduce a methodology called Induced Model Matching (IMM). IMM aligns the context-restricted, or induced, version of the large model with the restricted model.
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models
Chang, Tyler A., Bergen, Benjamin K.
In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language model subnetworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. These subnetworks can lay a foundation for studying more complex language model circuits by building up from a minimal circuit.
Question Asking as Program Generation
Anselm Rothe, Brenden M. Lake, Todd Gureckis
A hallmark of human intelligence is the ability to ask rich, creative, and revealing questions. Here we introduce a cognitive model capable of constructing humanlike questions. Our approach treats questions as formal programs that, when executed on the state of the world, output an answer. The model specifies a probability distribution over a complex, compositional space of programs, favoring concise programs that help the agent learn in the current context. We evaluate our approach by modeling the types of open-ended questions generated by humans who were attempting to learn about an ambiguous situation in a game. We find that our model predicts what questions people will ask, and can creatively produce novel questions that were not present in the training set. In addition, we compare a number of model variants, finding that both question informativeness and complexity are important for producing human-like questions.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Ohio (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Europe > United Kingdom > North Sea > Southern North Sea (0.04)
- North America > United States > New Mexico > Los Alamos County > Los Alamos (0.04)
- North America > United States > New York (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Government (0.93)
- Education (0.68)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.49)
Urban Incident Prediction with Graph Neural Networks: Integrating Government Ratings and Crowdsourced Reports
Balachandar, Sidhika, Sadhuka, Shuvom, Berger, Bonnie, Pierson, Emma, Garg, Nikhil
Graph neural networks (GNNs) are widely used in urban spatiotemporal forecasting, such as predicting infrastructure problems. In this setting, government officials wish to know in which neighborhoods incidents like potholes or rodent issues occur. The true state of incidents (e.g., street conditions) for each neighborhood is observed via government inspection ratings. However, these ratings are only conducted for a sparse set of neighborhoods and incident types. We also observe the state of incidents via crowdsourced reports, which are more densely observed but may be biased due to heterogeneous reporting behavior. First, for such settings, we propose a multiview, multioutput GNN-based model that uses both unbiased rating data and biased reporting data to predict the true latent state of incidents. Second, we investigate a case study of New York City urban incidents and collect, standardize, and make publicly available a dataset of 9,615,863 crowdsourced reports and 1,041,415 government inspection ratings over 3 years and across 139 types of incidents. Finally, we show on both real and semi-synthetic data that our model can better predict the latent state compared to models that use only reporting data or models that use only rating data, especially when rating data is sparse and reports are predictive of ratings. We also quantify demographic biases in crowdsourced reporting, e.g., higher-income neighborhoods report problems at higher rates. Our analysis showcases a widely applicable approach for latent state prediction using heterogeneous, sparse, and biased data.
- North America > United States > District of Columbia > Washington (0.04)
- North America > United States > New York > Bronx County > New York City (0.04)
- Asia > Bangladesh (0.04)
- Africa > Comoros > Grande Comore > Moroni (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Transportation > Ground > Road (0.67)
- Transportation > Infrastructure & Services (0.67)
- Government > Regional Government > North America Government > United States Government (0.47)
- (2 more...)
- Information Technology > Communications > Social Media > Crowdsourcing (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Contextual Sparsity with Correction for Efficient LLMs Y ang Zhou
With the blossom of large language models (LLM), inference efficiency becomes increasingly important. V arious approximate methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without significant performance degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, it significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks.
A Mathematical Details
We provide additional experimental results to supplement Section 6 . In Section B.1, we include In Section B.2, we present a few example outputs with a visualization In Section B.3, we include results Figure B.1 and Figure B.2 present the empirical The bottom row presents the average number of decoder layer used. The bottom row presents the average number of decoder layer used. Figure B.5 presents two example outputs of CALM for instances from the machine translation, and We observe that the textual distance generally increases as we accelerate the decoding. Interestingly, following our initial intuition, CALM distributes the compute unevenly, using very few layers for certain "easy" tokens, and additional compute to "hard" tokens.
Delta-learned force fields for nonbonded interactions: Addressing the strength mismatch between covalent-nonbonded interaction for global models
Cázares-Trejo, Leonardo, Loreto-Silva, Marco, Sauceda, Huziel E.
Noncovalent interactions--vdW dispersion, hydrogen/halogen bonding, ion-$π$, and $π$-stacking--govern structure, dynamics, and emergent phenomena in materials and molecular systems, yet accurately learning them alongside covalent forces remains a core challenge for machine-learned force fields (MLFFs). This challenge is acute for global models that use Coulomb-matrix (CM) descriptors compared under Euclidean/Frobenius metrics in multifragment settings. We show that the mismatch between predominantly covalent force labels and the CM's overrepresentation of intermolecular features biases single-model training and degrades force-field fidelity. To address this, we introduce \textit{$Δ$-sGDML}, a scale-aware formulation within the sGDML framework that explicitly decouples intra- and intermolecular physics by training fragment-specific models alongside a dedicated binding model, then composing them at inference. Across benzene dimers, host-guest complexes (C$_{60}$@buckycatcher, NO$_3^-$@i-corona[6]arene), benzene-water, and benzene-Na$^+$, \mbox{$Δ$-sGDML} delivers consistent gains over a single global model, with fragment-resolved force-error reductions up to \textbf{75\%}, without loss of energy accuracy. Furthermore, molecular-dynamics simulations further confirm that the $Δ$-model yields a reliable force field for C$_{60}$@buckycatcher, producing stable trajectories across a wide range of temperatures (10-400~K), unlike the single global model, which loses stability above $\sim$200~K. The method offers a practical route to homogenize per-fragment errors and recover reliable noncovalent physics in global MLFFs.
- North America > Mexico (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)